PROTAN: Readme.first 


Saturday, November 11, 2017: Issue 1: FOREWORD 

PROTAN is a content-analytic engine. PROTAN does the many tedious tasks that 
a human being can do but avoids doing, like counting words. Often, without 
further notice, PROTAN will do its job "by default", that is, by assuming that 
parameters have the values given initially to the system. For instance, many 
systems' tasks need as little information as a semicolon, picking in its memory for 
implicit infonnation. Never, however, PROTAN does automatic content analysis. 
With PROTAN, you must think. At least a little. 

This "README" file aims at introducing you to PROTAN. It aims too at telling 
you how the rest of this presentation of PROTAN is organized in these pages. 


WHAT THIS NOTE IS NOT 


This note is not a substitute for the bulky 265-pages User's Manual in French en 
plus'. Neither does this note give you the possibility of testing the system. 
However, you will find here some typical results that PROTAN produces. 

PROTAN needs a text: titles of a scientific journal through its publication years, 
poetry, terrorist pamphlets, and advertising blurbs. I picked a text that everybody 
knows. When I will apply PROTAN to this text, most people will have an 
intuitive understanding of what the system does to a text. Hamlet is that text. 

THE AIMS OF THE SYSTEM 


PROTAN is tuned to two different tasks. In the first one, PROTAN addresses the 
question of how does the text look like. Is it abstract, does it become ever more 
abstract, or less? What is the profile of the main connotations in the text? In 
Hamlet, I will show the general mood progresses as an inverted U, with the 
second branch of the U going much lower than the first one. Such a finding cuts 
not ice. This is the reason I picked this text. 

The second task that PROTAN is tuned to, is to answer the question of what the 
text is talking about. What are the main themes in it? A theme, like any interest, is 
never fixed. We usually want to know how the interests in a text come and go. 

The trick of PROTAN, as of Iker's WORDS 2 system from which I got the idea, is 
to postulate there is enough information in the relations between words to allow 
for themes to emerge by simply analyzing these relations. 

TOOLS 


To carry out its tasks, PROTAN benefits of three tools. These are the partition, 
the lemmatization, and the dictionaries. 

Partition stands for what it means to each of us. We will indeed divide the text 
into as many parts as we feel suitable. If possible, these parts should be 



meaningful, that is, letters, chapters of a book, acts of a play. We can also divide 
the text into artificial units, that is, parts of 700 words each, or we may have 
reasons to decide that we need to divide the text into 20 equal parts. 

One program takes care of the job of slicing. Its name is CSCUT. You will meet 
it. This program can be complex. This step must be taken with care. Further 
analyses depend on it. 

Lemmatization is a barbarism to label the operation by which the various endings 
of words (plurals, conjugations) are transfonned into a simpler form, for example, 
the infinitive for verbs. 

Dictionaries are systems of categories (great dimensions of the mind) that an 
analyst can be interested in. PROTAN is equipped with several such dictionaries 
in different languages. In the analysis of Hamlet, I will use the Dictionary of 
Affect by Whissell and her associates 3 . 

PLATFORMS 


In its 2017 version, PROTAN works only in Windows. 


SOP 


(Standard Operating Procedures) 


PROTAN is composed of 30 programs. These programs are modular. This means 
that each of them has a specific role in a logical chain. For instance, program 
CRWSTRIP that lemmatizes words takes its input from program CSCUT (the one 
that takes care of slicing texts) and produces an output (a system file) to be 
crunched by other ones. 

All programs produce at least one output, that is, a listing of results. Occasionally, 
programs produce two outputs: a list of results and either a system file ready to 
be used by the next program or a numeric file to be processed by some statistical 
package, or both. In my analysis of Hamlet, the output from the comparison 
between text and dictionary is sent out to SAS for polynomial analysis. I did not 
equipped PROTAN with statistical software. 

A LIST OF PROGRAMS 


Following is a list of programs that are part of PROTAN. These are the 
operations the system can do. Not all these programs are necessary to have a 
successful run. Many of these programs are for creating or editing dictionaries, or 
striplists, or for editing the text. For convenience, the list is alphabetical. 

CDCHECK: creates a dictionary. 

CDLISTA/CDLISTC: prints dictionary alphabetically or by category. 

CDWJUXT: reports on the co-occurrences of categories from two different dictionaries in the 
same text. 

CDWLOOK: reports on the occurrences of categories from one dictionary at a time. 

CFCHECK: used in finding specific occurrences of specific words. 

CFLISTA: prints the FINDS-type file created by program CFCHECK. 

CFWKWIC: prints the context of specific occurrences of specific words. 
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CPEXCOR: picks among the most richly associative words in a series 
CPFACTOR: factor analyzes a matrix of words by observations 
CRCHECK: creates a striplist (lemmatization). 

CRLISTA: prints a striplist. 

CRWSTRIP: strips texts. 

CSCHECK: checks if the text has been entered correctly for the syntax of PROTAN. 
CSCUT: slices text. 

CSED1T: edits text. 

CSJOIN: concatenates texts. 

CSSORT: sorts information. 

CWADD: adds extraneous information into a system file creates by PROTAN. 

CWEDIT: edits texts. 

CWFLOW: computes moving averages of the number of new words appearing insuccessive 
intervals of text. 

CWKWIC: a key-word-in-context program. 

CWKWOC: a key-word-out-of-context program. 

CWNEW: reports on new words appearing in the text (and on old ones disappearing). 
CWPAT: looks of patterns of sequences of words. 

CWREFER: creates a reference list of words. 

CWSELECT: creates an observation by words frequency matrix. 

CWTALLY: tallies words with their frequencies. 

CWWCOL: compares words frequencies for two series of texts. 

CWWORD: compares word frequencies for two series of texts about a set of words. 


WHAT'S NEXT? 


The rest of this presentation consists of five parts. FileOl shows how the text 
looks like in PROTAN. It shows too what CSCHECK does to a text. File02 
pictures one way of slicing a text with CSCUT. File03 shows what happens when 
PROTAN strips a text. File04 pictures the results of a comparison between the 
text and a dictionary of affect. Finally, File05 shows the manner by which the 
system manages the word-word correlations in a text. 
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PROTAN: Hamlet (a source text for PROTAN) 

Saturday, 11 November 2017: Issue File01.com 

We consider two topics in this FILE01.COM note. The first one is the source text 
to be analyzed; the second one is the listing that results from running the program 
CSCHECK. 

Hamlet: source text 


File HAMLET.SOU shows the first 20 and the last 20 lines of the drama. Text has 
been entered as in the book. By now, you will have noticed one peculiarity. The 
text occupies the space comprised between columns 1 and 70. Columns 71 to 80 
are reserved for marks of interview, units, and speaker. These marks mean what 
you decide them to mean. You can indeed divide a text by interviews, by units 
within these interviews, and by speakers within these interviews or units. In the 
present case, I decided to create one unique undivided segment that is the drama 
itself. More about this when I will introduce you to program CSCUT that slices 
texts. 

Notice the first two lines of file HAMLET. SOU. These lines identify the name of 
the text; they also serve for the comments you would like to go with the text, 
should you want that too. 

As it is in front of you, the file HAMLET. SOU is ready to be sent to program 
CSCHECK. Its purpose is to check that columns 71 to 80 have been correctly 
entered and there are no "weird" characters that might have been introduced 
accidentally in the text. 

Program CSCHECK 

This program does not much more that producing a listing. The listing says either 
that the text is correctly entered, or that it has anomalies in it. You must correct 
these anomalies if you want to continue safely with the other programs. 

Notice too, at the beginning of the listing, the statement "SENT-.!;?:"'. It tells 
you that PROTAN used the default choices for deciding when a chain of words 
was a sentence. This option can be changed to fit your preferences. 

At the end of the listing, CSCHECK gives you preliminary information about the 
text (total number of words). 
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HAMLET.SOU 

*source-hamlet 
*endcom 


HAMLET 



001 

001a 

WHO IS THERE 



001 

001a 

NAY ANSWER ME: STAND UNFOLD 



001 

001a 

YOURSELF 



001 

001a 

LONG LIVE THE 

KING. 


001 

001a 

BARNARDO 



001 

001a 

HE. 



001 

001a 

YOU COME MOST CAREFULLY UPON 

YOUR 

HOUR. 

001 

001a 

TIS NOW STRUCK TWELVE, GET 

THEE TO BED 

001 

001a 

FRANCISCO. 



001 

001a 

FOR THIS RELIEF MUCH THANKS: TIS 

BITTER COLD, 

001 

001a 

AND I AM SICK AT HEART. 



001 

001a 

HAVE YOU HAD QUIET GUARD 



001 

001a 

NOT A MOUSE STIRRING. 



001 

001a 

WELL, GOODNIGHT. IF YOU DO 

MEET HORATIO AND 

001 

001a 

MARCELLUS, THE RIVALS OF MY 

WATCH 

, BID 

001 

001a 

THEM MAKE HASTE. 



001 

001a 

I THINK I HEAR THEM. STAND 

: WHO 

IS THERE 

001 

001a 

FRIENDS TO THIS GROUND. 



001 

001a 

AND LIEGE MEN TO THE DANE. 



001 

001a 


•k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k 
:k ~k 

* and so on until the last lines as follows ... * 


•k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k'k 


FOR ME, WITH SORROW, I EMBRACE MY FORTUNE, 

I HAVE SOME RIGHTS OF MEMORY IN THIS KINGDOM, 
WHICH ARE TO CLAIM, MY VANTAGE DOTH 
INVITE ME, 

OF THAT I SHALL HAVE ALWAYS CAUSE TO SPEAK, 

AND FROM HIS MOUTH 

WHOSE VOICE WILL DRAW ON MORE 

BUT LET THIS SAME BE PRESENTLY PERFORMED, 

EVEN WHILE MENS MINDS ARE WILD, 

LEST MORE MISCHANCE 
ON PLOTS, AND ERRORS HAPPEN. 

LET FOUR CAPTAINS 

BEAR HAMLET LIKE A SOLDIER TO THE STAGE, 

FOR HE WAS LIKELY, HAD HE BEEN PUT ON 
TO HAVE PROVED MOST ROYALLY: 

AND FOR HIS PASSAGE, 

THE SOLDIERS MUSIC, AND THE RITES OF WAR 
SPEAK LOUDLY FOR HIM. 

TAKE UP THE BODY; SUCH A SIGHT AS THIS 
BECOMES THE FIELD, BUT HERE SHOWS MUCH AMISS. 


001 001a 
001 001a 
001 001a 
001 001a 
001 001a 
001 001a 
001 001a 
001 001a 
001 001a 
001 001a 
001 001a 
001 001a 
001 001a 
001 001a 
001 001a 
001 001a 
001 001a 
001 001a 
001 001a 
001 001a 
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HAMLETsch.LIS 


PL=60; 

SLISTING='dd:sysprint'; 
SMASTER='dd:master'; 


Hamlet 

PROTAN: A Content-Analytic Engine 
A Demonstration 


Robert Hogenraad 


** PROTAN SYSTEM UCL/PSP AT LOUVAIN-LA-NEUVE PAGE: 

** PROGRAM: CSCHECK DATE: Apr29 94 TIME: 14:25:44 

OPTIONS IN EFFECT 


COMM=0; 

PRINT=0; 

SENT= 

SSOURCE= 1 * 1 ; 

** PROTAN SYSTEM UCL/PSP AT LOUVAIN-LA-NEUVE PAGE: 

** PROGRAM: CSCHECK DATE: Apr29 94 TIME: 14:25:44 

"SOURCE" FILE DEFINITIONS 


1 


* * 
* * 


2 


* * 
* * 


"SOURCE" FILE IDENTIFICATION (USER-DEFINED). hamlet 

USER'S COMMENT. 

GENERAL COMMENTS (HISTORY OF "SOURCE" & "WORDS" FILES AND OTHER COMMENTS) 


*CSCH149I end of general comments 

** PROTAN SYSTEM UCL/PSP AT LOUVAIN-LA-NEUVE PAGE: 3 ** 
** PROGRAM: CSCHECK DATE: Apr29 94 TIME: 14:25:44 ** 
** hamlet ** 
INPUT SOURCE TEXT I US 


*CSCH023I end of data set on 'HAMLET1.SOURCE.A' 
lines read from 1 to 417 
try to open the next data set if any 
*CSCH023I end of data set on 'HAMLET2.SOURCE.A' 
lines read from 418 to 917 
try to open the next data set if any 
*CSCH023I end of data set on 'HAMLET3.SOURCE.A' 
lines read from 918 to 1417 
try to open the next data set if any 
*CSCH023I end of data set on 'HAMLET4.SOURCE.A' 
lines read from 1418 to 1917 
try to open the next data set if any 
*CSCH023I end of data set on 'HAMLET5.SOURCE.A' 
lines read from 1918 to 2417 
try to open the next data set if any 
*CSCH023I end of data set on 'HAMLET6.SOURCE.A' 
lines read from 2418 to 2917 
try to open the next data set if any 
*CSCH023I end of data set on 'HAMLET7.SOURCE.A' 
lines read from 2918 to 3417 
try to open the next data set if any 
*CSCH023I end of data set on 'HAMLET8.SOURCE.A' 
lines read from 3418 to 3680 
try to open the next data set if any 
*CSCH062I end of data set chain reached when opening '*' 


source file complete 

TOTAL NUMBER OF WORDS. 27421 

TOTAL NUMBER OF MARKERS.... 0 
TOTAL NUMBER OF SENTENCES.. 2118 

TOTAL NUMBER OF ITEMS. 322 99 

TOTAL NUMBER OF INTERVIEWS. 1 
TOTAL NUMBER OF INT/UNITS.. 1 
TOTAL NUMBER OF SPEAKERS... 1 
SPEAKER CODES USED : 


a 

** PROTAN SYSTEM UCL/PSP AT LOUVAIN-LA-NEUVE 

** PROGRAM: CSCHECK DATE: Apr29 94 TIME: 14:26:02 

*CYEX001I processing terminated 
normal end of job 


PAGE: 


4 


* * 


* * 
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PROTAN: Segmenting a text 
Saturday, 11 November 2017: IssueFile02.com 

The CSCUT program 

Program CSCUT has two main functions. It is used first to slice the text or series 
of such ones into parts. It is used secondly to transform the series of texts attached 
the ones to the others into one single system file easy to handle with by the other 
programs. 

Remember the source text had columns 71 to 80 reserved for alphanumeric 
marks. Program CSCUT uses this information to create 20 slices out of the 20 
interviews the text has been divided into. I could also have created 20 slices out of 
5 interviews each made of 4 units. Or I could have used the natural partition of 
Hamlet in 5 acts, each act divided respectively into 5, 2, 4, 7, and 2 scenes. I am 
sure you got the idea. 

Alternatively, you could set the Interviews, Units and Speakers at "001 001a", as 
in the HAMLET.SOU example, and divide the source text into 20 equal parts. 

HAMscu.OLIS 


The HAMscu.LIS file shows exactly that: The whole drama was divided into 20 
equal parts each made of 1,371 words, except the last slice composed of 1,372 
words. My reasoning was the following: I thought that if a phenomenon, say 
pleasantness, was strong enough, it should show whatever the way you slice the 
text. 

A few more statistics are provided, like the type-token-ratio, the average sentence 
length, average word length, % of words greater than 9, and the Gunning index of 
readability. The last information concerns the distribution of sentence lengths: 
How many sentences between 1 and 5 words, how many between 6 and 10 words. 

Notice the indications of date and time on top of the various pages of the program. 
Some options that need more processing time. 
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HAMscu.LIS 

PL=60; 

SLISTING='dd:sysprint' ; 
SMASTER='dd:master' ; 


Hamlet 

PROTAN: A Content-Analytic Engine 
A Demonstration 
Robert Hogenraad 


** PROTAN SYSTEM 

** PROGRAM: CSCUT 

GENERAL OPTIONS IN EFFECT 


UCL/PSP AT LOUVAIN-LA-NEUVE 
DATE: Apr29 94 TIME: 14:26:18 


PAGE: 


COMM=0; 

PRINT=0; 

PSTAT=0; 

PUNCH=0; 

SPSTAT='dd:punstat'; 

SPUNCH= 1 dd:syspunch'; 
SSOURCE=' * ' ; 

STAT=1; 

STEMP=' dd: * ' ; 

SWORDS= 1 dd:words'; 

TABLE=0; 

WORD—1; 

PROCESSING OPTIONS IN EFFECT 


1 


* * 


* * 


BRKB='S'; 

BRKM=1; 

CTRL=0; 

MODULO=20; 

NARR=1; 

REGEN=0; 

RESIDUE='MERGE'; 

SEGT=''; 

SENT='.?]'; 

** PROTAN SYSTEM UCL/PSP AT LOUVAIN-LA-NEUVE PAGE: 2 ** 


* PROGRAM: CSCUT DATE: Apr29 94 TIME: 14:26:18 

SOURCE" FILE DEFINITIONS 


"SOURCE" FILE IDENTIFICATION (USER-DEFINED). HAMLET 

OUTPUT FILES CREATION DATE. Apr2 9 94 

OUTPUT FILES CREATION TIME. 14:26:18 

USER'S COMMENT. 


GENERAL COMMENTS (HISTORY OF "SOURCE" FILES AND OTHER COMMENTS) 


*CSCU149I end of general comments 

** PROTAN SYSTEM UCL/PSP AT LOUVAIN-LA-NEUVE 

** PROGRAM: CSCUT DATE: Apr29 94 TIME: 14:26:19 

** HAMLET 
INPUT SOURCE TEXT 


*CSCU023I end of data set on 'HAMLET1.SOURCE.A' 
lines read from 1 to 417 
try to open the next data set if any 
*CSCU023I end of data set on 'HAMLET2.SOURCE.A' 
lines read from 418 to 917 
try to open the next data set if any 
*CSCU023I end of data set on 'HAMLET3.SOURCE.A' 
lines read from 918 to 1417 
try to open the next data set if any 
*CSCU023I end of data set on 'HAMLET4.SOURCE.A' 
lines read from 1418 to 1917 
try to open the next data set if any 
*CSCU023I end of data set on 'HAMLET5.SOURCE.A' 
lines read from 1918 to 2417 
try to open the next data set if any 
*CSCU023I end of data set on 'HAMLET6.SOURCE.A' 
lines read from 2418 to 2917 
try to open the next data set if any 
*CSCU023I end of data set on 'HAMLET7.SOURCE.A' 
lines read from 2918 to 3417 
try to open the next data set if any 
*CSCU023I end of data set on 'HAMLET8.SOURCE.A' 
lines read from 3418 to 3680 


PAGE: 


* * 
* * 
* * 
u s 






I * I 


try to open the next data set if any 
*CSCU062I end of data set chain reached when opening 
source file complete 
NUMBER OF SEGMENTS GENERATED : 1 

NUMBER OF SEGMENTS GENERATED BY MODULO/REGEN : 20 

** PROTAN SYSTEM UCL/PSP AT LOUVAIN-LA-NEUVE PAGE: 4 ** 

** PROGRAM: CSCUT DATE: Apr29 94 TIME: 14:54:26 ** 

** HAMLET ** 

SORT STATISTICS — PART 1 
SEG DIFF TOTAL T. T. 

WORDS WORDS RATIO 


1 

556 

1371 

0.406 

2 

515 

1371 

0.376 

3 

537 

1371 

0.392 

4 

530 

1371 

0.387 

5 

502 

1371 

0.366 

6 

506 

1371 

0.369 

7 

497 

1371 

0.363 

8 

572 

1371 

0.417 

9 

566 

1371 

0.413 

10 

546 

1371 

0.398 

11 

588 

1371 

0.429 

12 

538 

1371 

0.392 

13 

555 

1371 

0.405 

14 

507 

1371 

0.370 

15 

543 

1371 

0.396 

16 

499 

1371 

0.364 

17 

536 

1371 

0.391 

18 

534 

1371 

0.389 

19 

539 

1371 

0.393 

20 

532 

1372 

0.388 

TOT 

4227 

27421 

0.154 

* * 

PROTAN 

SYSTEM 

UCL/PSP AT LOUVAIN-LA-NEUVE 

* * 

PROGRAM 

: CSCUT 

DATE: Apr29 94 TIME: 14:54 


** HAMLET 


SORT STATISTICS — PART 2 


SEG 

NO 

AV. 

. SENT. SD SENT. 

LENGTH LENGTH 

AV. WORD 

LENGTH 


SD WORD 

LENGTH 

% WORDS 

L. >= 9 

GUNNING 

INDEX 

i 


21.422 

24.575 

4.140 


2.023 

4.012 

10.173 

2 


22.850 

17.632 

4.070 


2.030 

3.428 

10.511 

3 


21.762 

19.829 

4.028 


2.020 

3.501 

10.105 

4 


17.577 

27.098 

3.992 


1.935 

2.918 

8.198 

5 


24.053 

23.973 

3.956 


2.118 

4.887 

11.576 

6 


20.463 

20.751 

3.945 


2.001 

3.866 

9.731 

7 


23.638 

22.235 

4.055 


2.128 

4.449 

11.235 

8 


21.422 

23.578 

4.110 


1.976 

2.918 

9.736 

9 


21.422 

23.303 

4.052 


2.065 

4.012 

10.173 

10 


18.781 

16.919 

4.065 


2.095 

4.741 

9.409 

11 


16.518 

14.579 

4.104 


1.902 

2.553 

7.628 

12 


21.092 

20.534 

3.926 


1.943 

3.209 

9.721 

13 


19.042 

15.261 

4.008 


1.888 

2.261 

8.521 

14 


18.527 

16.321 

3.985 


1.951 

3.647 

8.870 

15 


17.137 

14.472 

4.042 


1.905 

2.918 

8.022 

16 


25.868 

26.678 

3.927 


1.904 

2.918 

11.514 

17 


29.804 

29.874 

4.007 


1.967 

3.647 

13.381 

18 


20.463 

14.074 

3.920 


1.857 

2.699 

9.265 

19 


28.563 

25.303 

4.010 


2.054 

3.793 

12.942 

20 


14.442 

14.047 

4.022 


1.985 

3.134 

7.030 

TOT 


20.602 

20.843 

4.018 


1.990 

3.475 

9.631 

* * 

PROTAN SYSTEM 

UCL/PSP AT 

LOUVAIN-LA 

-NEUVE 

PAGE: 

* * 

PROGRAM: CSCUT DATE: Apr29 

94 

TIME: 

14:54:26 


** HAMLET 








SENTENCES 

FREQ 

% 

GRAPH 





LENGTH 








i - 

5 

258 

19.665 

kkkkkkkkk 

* * 

kkkkkkkkk 


kkkkkkkkk-. 

6 - 

10 

257 

19.588 

kkkkkkkkk 

k k 

kkkkkkkkk 

kkkkkkkkkkkk 

kkkkkkkkk-. 

11 - 

15 

182 

13.872 

kkkkkkkkk 

k k 

kkkkkkkkk 

kkkkkkkkkkkk 

k 

16 - 

20 

155 

11.814 

kkkkkkkkk 

k k 

kkkkkkkkk 

kkkkkkkk 


21 - 

25 

106 

8.079 

kkkkkkkkk 

k k 

kkkkkkkk 



26 - 

30 

76 

5.793 

kkkkkkkkk 

k k 

k k 



31 - 

35 

52 

3.963 

kkkkkkkkk 





36 - 

40 

46 

3.506 

kkkkkkkk 





41 - 

45 

49 

3.735 

kkkkkkkk 





46 - 

50 

28 

2.134 

k k k k k 
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51 

- 

55 

ii 

0.838 

* * 

56 

- 

60 

21 

1.601 

*** 

61 

- 

65 

11 

0.838 

* * 

66 

- 

70 

12 

0.915 

* * 

71 

- 

75 

16 

1.220 

* * 

76 

- 

80 

2 

0.152 


81 

- 

85 

2 

0.152 


86 

- 

90 

2 

0.152 


91 

- 

95 

4 

0.305 


96 

- 

100 

4 

0.305 


101 

- 

105 

6 

0.457 

* 

106 

- 

110 

1 

0.076 


111 

- 

115 

1 

0.076 


116 

- 

120 

1 

0.076 


121 

& 

> 

9 

0.686 

* 

TOTAL 

1312 



* * 


PROTAN 

SYSTEM 


UCL/PSP AT LOUVAIN-LA-NEUVE 

* * 


PROGRAM 

:: CSCUT 

DATE: Apr29 94 TIME: 14:54 


*CYEX001I processing terminated 
normal end of job 


PAGE: 7 * * 

* * 
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PROTAN: Stripping a text 
Saturday, 11 November 2017: lssueFile03.com 
The CRWSTRIP program 

CRWSTRIP strips. Words have different endings for adjectives and adverbs, for 
first, second, third persons, singular or plural, etc. We want to transform these 
endings into simpler forms. Why indeed have one “travelling", another 
"travelled", and still another "travel"? The world be simpler if we had one "travel" 
for all these various forms (and pile up all their separate frequencies on that one 
word "travel". 

The program CRWSTRIP does that and nothing else. The system file that CSCUT 
created is read as input by CRWSTRIP, stripped of all its "appendices", and 
reappears as another system file. 

HAMrws.LIS 


In HAMrws.LIS, you see -in bold, toward the end of the listing- that the 
HAMLET.SOU text is composed of 27, 421 words and 4, 227 different words. 
After stripping, the total number of words remains the same (no word has been 
deleted), but the number of different words passed from 4, 227 to a mere 3, 483 

After this step, we are ready to submit the stripped system file to the next 
programs, such as CDWLOOK, that compares the text with a dictionary. More 
anon. 
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HAMrws.LIS 


PL=60; 

SLISTING='dd:sysprint'; 
SMASTER='dd:master'; 


Hamlet 

PROTAN: A Content-Analytic Engine 
A Demonstration 


Robert Hogenraad 


** PROTAN SYSTEM UCL/PSP AT LOUVAIN-LA-NEUVE PAGE: 

** PROGRAM: CRWSTRIP DATE: Apr29 94 TIME: 14:54:41 

OPTIONS IN EFFECT 


PUNCH=0; 

ROOTT=0; 

SPUNCH— 1 dd:syspunch'; 

SROOTS='dd:roots' ; 

STAT=1; 

STEMP='dd:*' ; 

SWORDS= 1 dd:words'; 

SWORDSO='dd:wordso'; 

TABLE=0; 

WORD=1; 

** PROTAN SYSTEM UCL/PSP AT LOUVAIN-LA-NEUVE PAGE: 

** PROGRAM: CRWSTRIP DATE: Apr29 94 TIME: 14:54:42 

"ROOTS" FILE DEFINITIONS 


1 


* * 
* * 


2 


* * 
* * 


"ROOTS" FILE IDENTIFICATION (USER-DEFINED). AESTRP01 

"ROOTS" FILE CREATION DATE. Apr06 92 

"ROOTS" FILE CREATION TIME. 14:30:38 


USER'S COMMENT. LISTE DE FORMES NOMINALES DE L'ANGLAIS (AMERICAN-ENGLISH) 
GENERAL COMMENTS 


AE011 - AE018 : STRIPLIST ORIGINAL IKER'S WORDS 
AE019 : ISSU DE "MARY CALKINS" AVRIL 1992 
STRIPLIST STATISTICS 


NUMBER OF WORD TO WORD SUBSTITUTIONS. 6489 
NUMBER OF ROOT TO WORD SUBSTITUTIONS. 25 
NUMBER OF ROOT TO ROOT SUBSTITUTIONS. 0 

TOTAL NUMBER OF SUBSTITUTIONS. 6514 

NUMBER OF WORDS TO SUPPRESS. 0 

NUMBER OF ROOTS TO SUPPRESS. 0 

TOTAL NUMBER OF SUPPRESSIONS. 0 

TOTAL NUMBER OF WORDS/ROOTS. 6514 

** PROTAN SYSTEM UCL/PSP AT LOUVAIN-LA-NEUVE PAGE: 3 ** 

** PROGRAM: CRWSTRIP DATE: Apr29 94 TIME: 14:54:54 ** 

** AESTRP01 LISTE DE FORMES NOMINALES DE L'ANGLAIS (AMERICAN-ENGLISH) ** 

"WORDS" FILE DEFINITIONS 


"WORDS" FILE IDENTIFICATION (USER-DEFINED). HAMLET 

"WORDS" FILE CREATION DATE. Apr29 94 

"WORDS" FILE CREATION TIME. 14:26:18 

OUTPUT FILES CREATION DATE. Apr2 9 94 

OUTPUT FILES CREATION TIME. 14:54:54 

USER'S COMMENT. 

INPUT "WORDS" FILE PROCESSING LEVEL.. 0 
PROCESSING OPTIONS PROVIDED TO "CSCUT" 


BRKB='S'; 

BRKM=1; 

CTRL=0; 

MODULO=20; 

NARR=1; 

REGEN=0; 

RESIDUE='MERGE'; 

SEGT=''; 

SENT='.?]'; 

NUMBER OF SEGMENTS GENERATED : 20 


GENERAL COMMENTS (HISTORY OF "SOURCE" & "WORDS" FILES AND OTHER COMMENTS) 


PROC BY CSCUT "HAMLET " (Apr29 94 14:26:18) 

** PROTAN SYSTEM UCL/PSP AT LOUVAIN-LA-NEUVE PAGE: 4 ** 

** PROGRAM: CRWSTRIP DATE: Apr29 94 TIME: 14:58:55 ** 
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** AESTRP01 LISTE DE 

** HAMLET 

FORMES 

NOMINALES 

DE L'ANGLAIS 

(AMERICAN- 

ENGLISH) 


INPUT SORT STATISTICS 


OUTPUT SORT 

STATISTICS 


SEG 

DIFF 

TOTAL 

T. T. 

DIFF 

TOTAL 

T. T. 


WORDS 

WORDS 

RATIO 

WORDS 

WORDS 

RATIO 

1 

556 

1371 

0.406 

489 

1371 

0.357 

2 

515 

1371 

0.376 

447 

1371 

0.326 

3 

537 

1371 

0.392 

470 

1371 

0.343 

4 

530 

1371 

0.387 

478 

1371 

0.349 

5 

502 

1371 

0.366 

422 

1371 

0.308 

6 

506 

1371 

0.369 

427 

1371 

0.311 

7 

497 

1371 

0.363 

428 

1371 

0.312 

8 

572 

1371 

0.417 

492 

1371 

0.359 

9 

566 

1371 

0.413 

489 

1371 

0.357 

10 

546 

1371 

0.398 

472 

1371 

0.344 

11 

588 

1371 

0.429 

502 

1371 

0.366 

12 

538 

1371 

0.392 

467 

1371 

0.341 

13 

555 

1371 

0.405 

486 

1371 

0.354 

14 

507 

1371 

0.370 

445 

1371 

0.325 

15 

543 

1371 

0.396 

479 

1371 

0.349 

16 

499 

1371 

0.364 

434 

1371 

0.317 

17 

536 

1371 

0.391 

471 

1371 

0.344 

18 

534 

1371 

0.389 

467 

1371 

0.341 

19 

539 

1371 

0.393 

478 

1371 

0.349 

20 

532 

1372 

0.388 

466 

1372 

0.340 

TOT 

4227 

27421 

0.154 

3483 

27421 

0.127 

LVL 0 

4227 

27421 

0.154 




* * 

PROTAN SYSTEM 


UCL/PSP 

AT LOUVAIN-LA 

-NEUVE 

PAGE: 

* * 

PROGRAM: CRWSTRIP 

DATE: Apr29 94 TIME: 

14:58:55 



*CYEX001I processing terminated 
normal end of job 



PROTAN: Dictionary comparisons 
Saturday, 11 November 2017: Issue File04.com 


CAN COMPUTERS BEAT LITERARY ANALYSTS? 


The previous program CRWSTRIP gave us a text cleaned of most of its 
declensions. At this point, we are ready to begin analyzing the drama of Hamlet. 

From experience, culture, and reading, we know that Hamlet is a sad drama. That 
is what dramas usually are. Is it sad all the way through the text? What is the 
saddest part in Hamlet, or the less? Literary critics may have glossed on these 
questions. Could a computer arrive at the same result? There lies both the core of 
content analysis and the core of its ethics. It is indeed appropriate to speak of 
ethics when a device is used to bring myths down to facts. 

A DICTIONARY OF AFFECT 


To answer these questions, we have a dictionary. This dictionary is composed of 
several thousands words (4,000 precisely) 1 . We have the ratings collected from 
several hundreds of judges on each of these words. Judges had to decide (on a 7- 
point scale) whether a word was pleasant (weight of 7) or unpleasant (rate of 1). 

We reason that the more a text is composed of words high in pleasantness, the 
more the text can be declared pleasant. The reverse is true too. 

File HAMev.LIS 


When you look at the graphic in file HAMev.LIS, you see that the drama is not 
immediately a sad one. It even starts by growing in pleasantness. From a certain 
point on, destiny begins its work. You can follow the evolution of Hamlet's 
disgust from the beginning to the end of the drama. Each corresponds to the 
predicted value of sadness in each successive slice of 1,371 words. 

For those interested, the quadratic profile in file HAMLETev.LIS explains 47% of 
the variance of the observed scores [ R 2 = .47, F( 2, 17) = 7.65,/? <.004]. 

The above analysis was done (using SAS) on a file produced by CDWLOOK. 
When we analyze the file created by program CDWLOOK, we are not in 
PROTAN any more. Here, I used SAS. Any other statistical software could have 
done the job. 

As a final note, I will add the following precision. The dictionary of affect used 
here is an American-English one. PROTAN is equipped with a similar dictionary 
of affect for the French and for the Portuguese. 

PROTAN is equipped with other dictionaries than those of affect. Martindale's 
Regressive Imagery Dictionary is one such: This dictionary exists in an American- 
English version, a French one, a German one, a Portuguese one, and a Russian 
one. Other ones are in preparation. 
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A MULTILINGUAL TOOL 


PROTAN is multilingual in that sense. I consider this as well a liability as an 
advantage. For you need to have a version of a dictionary in each language in 
which you plan to work. Dictionaries are not easy to build up. 

The next and final program, CWSELECT, prepares statistical analyses on texts 
irrespectively of the language of the text. 
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HAMLETev.LIS 


Plot of HAMTPEVF*SEG. Symbol used is 


45.00 + 


44.75 + 
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PROTAN: Word-word associations 
Saturday, 11 November 2017: Issue File05.com 
FORM versus THEMES 


We just saw the profile of Hamlet in pleasantness. However, such an inverted U- 
shaped profile is not unique of Hamlet. Many texts exhibit a similar profile in 
pleasantness (or sadness). 

With dictionaries, we analyze formal aspects of texts. That does not tell us what 
the text is about. 

The analysis to be sketched out here rests on a postulate. The postulate was 
formulated by Iker 2 as follows. Enough information, goes Iker, exists within the 
associations among the words of a text to allow data-generated elicitation of major 
content themes and material. This final section of the presentation makes full use 
of this postulate. 

Program CWSELECT 

The purpose of CWSELECT is to create a frequency matrix. One entry of the 
matrix lists the observations (segments) into which the text is divided. The other 
entry lists a selection of words extracted from the words that compose the text. (I 
will not describe how the words are chosen. It should be enough to know that 
there is a simple way of doing this and a sophisticated way of doing it). 

THE MEASUREMENT OF WORD-WORD ASSOCIATIONS. 


Words are then correlated among each other across the observations. The 
correlation matrix is then factor or cluster analyzed. The resulting factors are the 
themes whose variations are the fabric of the text. 

Very often, the number of words exceeds the number of observations. This is a 
purely statistical difficulty but it is a serious objection to the use of factor analysis 
in textual analysis. When necessary, I turn the quandary by using correspondence 
analysis a la Greenacre 3 . 

File HAMfac.LIS. 


File HAMfac.LIS shows extracts from a factor analysis of a matrix of 20 words 
and 20 observations. The two factors requested are shown with their Varimax 
rotation. The GLM analysis on the factor scores of these two factors reveals that 
Factor2 has a “W” profile. This profile, shown at the end of HAMfac.LIS, 
explains 69% of the variance [ R 2 = .69, F( 4, 15) = 8.32, /? <.001], 

Sometimes, observing the evolution of factor scores through time helps to make 
sense of the factors. 

I am not an expert in Hamlet. All I can say is that words covary. These words are 
watch, Fortinbras, night, spirit, speak, and strike for Factor 2 (with the 
characteristic W profile); words that characterize Factor 1 are heaven, hand, 
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villain, foul, swear, bleed, and Denmark. 

Usually, the analyst likes to attach a name on a factor. I tried —not very hard— and 
failed. It is not that important for the presentation. 
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HAMfac.LIS 

Rotation Method: Varimax 



FACTORl 

FACTOR2 



RFl 

-0.18746 

0.81859 

RF 

WATCH 

RF2 

-0.10368 

0.80184 

RF 

FORTINBRAS 

RF3 

-0.23778 

-0.19408 

RF 

HUSBAND 

RF4 

-0.06971 

-0.27682 

RF 

GRAVE 

RF5 

0.81434 

0.07406 

RF 

HEAVEN 

RF6 

0.79425 

-0.18211 

RF 

HAND 

RF7 

0.83877 

-0.04252 

RF 

VILLAIN 

RF8 

-0.29919 

-0.43276 

RF 

LOVE 

RF9 

0.90943 

-0.07080 

RF 

FOUL 

RFl 0 

-0.02818 

0.81590 

RF 

NIGHT 

RFl 1 

-0.24678 

-0.20486 

RF 

DOUBT 

RFl 2 

0.11330 

0.16114 

RF 

HAMLET 

RFl 3 

0.83542 

-0.13520 

RF 

SWEAR 

RFl 4 

-0.32112 

-0.19108 

RF 

LIGHT 

RFl 5 

0.06624 

0.70067 

RF 

SPIRIT 

RFl 6 

0.02679 

0.86483 

RF 

SPEAK 

RFl 7 

-0.06107 

0.16987 

RF 

PIRRHUS 

RFl 8 

0.75687 

0.00863 

RF 

BLEED 

RFl 9 

0.66329 

-0.04879 

RF 

DENMARK 

RF20 

-0.05871 

0.88416 

RF 

STRIKE 


Variance explained by each factor 

FACT0R1 FACT0R2 
4.922049 4.500412 


General Linear Models Procedure 


Dependent Variable: FACTORl 


Source 

DF 

Sum of Squares 

F Value Pr > F 

Model 

5 

4.67947018 

0.91 0.4993 

Error 

14 

14.32052982 


Corrected Total 

19 

19.00000000 



R-Square 

C.V. 

FACTORl Mean 


0.246288 

-9999.99 

- 0.00000000 


Source 

DF 

Type I SS 

F Value 

Pr > F 

SEG1 

1 

1.06365611 

1.04 

0.3252 

SEG2 

1 

0.10179871 

0.10 

0.7571 

SEG3 

1 

0.54301827 

0.53 

0.4783 

SEG4 

1 

0.37534247 

0.37 

0.5544 

SEG5 

1 

2.59565460 

2.54 

0.1335 

Source 

DF 

Type III SS 

F Value 

Pr > F 

SEG1 

1 

3.15638027 

3.09 

0.1008 

SEG2 

1 

3.15457139 

3.08 

0.1009 

SEG3 

1 

2.95318691 

2.89 

0.1114 

SEG4 

1 

2.75814377 

2.70 

0.1228 

SEG5 

1 

2.59565460 

2.54 

0.1335 




T for HO: 

Pr > !T! 

Std Error of 

Parameter 

Estimate 

Parameter=0 


Estimate 

INTERCEPT 

-3.218506057 

-1.43 

0.1741 

2.24788459 

SEG1 

3.463925864 

1.76 

0.1008 

1.97192073 

SEG2 

-0.961725382 

-1.76 

0.1009 

0.54764159 

SEG3 

0.109229981 

1.70 

0.1114 

0.06428534 

SEG4 

-0.005486598 

-1.64 

0.1228 

0.00334126 
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SEG5 


0.000100949 


1.59 


0.1335 


0.00006337 


General Linear Models Procedure 


Dependent Variable: FACTOR2 


Source 

DF 

Sum of Squares 

F Value Pr > F 

Model 

4 

13.09496146 

8.32 0.0010 

Error 

15 

5.90503854 


Corrected Total 

19 

19.00000000 



R-Square 

C.V. 

FACTOR2 Mean 


0.689208 

-9999.99 

- 0.00000000 


Source 

DF 

Type I SS 

F Value 

Pr > F 

SEG1 

1 

3.01489072 

7.66 

0.0144 

SEG2 

1 

3.73036324 

9.48 

0.0076 

SEG3 

1 

1.53142049 

3.89 

0.0673 

SEG4 

1 

4.81828701 

12.24 

0.0032 

Source 

DF 

Type III SS 

F Value 

Pr > F 

SEG1 

1 

8.18482213 

20.79 

0.0004 

SEG2 

1 

6.34766267 

16.12 

0.0011 

SEG3 

1 

5.40347375 

13.73 

0.0021 

SEG4 

1 

4.81828701 

12.24 

0.0032 




T for HO: 

Pr > !T! 

Std Error of 

Parameter 

Estimate 

Parameter=0 


Estimate 

INTERCEPT 

5.257230713 

5.38 

0.0001 

0.97743500 

SEG1 

-2.785726374 

-4.56 

0.0004 

0.61094142 

SEG2 

0.459201196 

4.02 

0.0011 

0.11435676 

SEG3 

-0.029969728 

-3.70 

0.0021 

0.00808932 

SEG4 

0.000669210 

3.50 

0.0032 

0.00019129 
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Plot of HAMPF2*SEG. Symbol used is 
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